Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Imbalanced data classification algorithm based on ball cluster partitioning and undersampling with density peak optimization

Xuewen LIU, Jikui WANG, Zhengguo YANG, Qiang LI, Jihai YI, Bing LI, Feiping NIE

Journal of Computer Applications 2022, 42 (5): 1455-1463. DOI: 10.11772/j.issn.1001-9081.2021050736

Abstract （282）

HTML （5）

PDF （1551KB）（73）

Save

It is an effective hybrid strategy for imbalanced data classification of integrating cost-sensitivity and resampling methods into the ensemble algorithms. Concerning the problem that the misclassification cost calculation and undersampling process less consider the intra-class and inter-class distributions of samples in the existing hybrid methods， an imbalanced data classification algorithm based on ball cluster partitioning and undersampling with density peak optimization was proposed， named Boosting algorithm based on Ball Cluster Partitioning and UnderSampling with Density Peak optimization （DPBCPUSBoost）. Firstly， the density peak information was used to define the sampling weights of majority samples， and the majority ball cluster with “neighbor cluster” was divided into “area misclassified easily” and “area misclassified hardly”， then the sampling weight of samples in “area misclassified easily” was increased. Secondly， the majority samples were undersampled based on the sampling weights in the first iteration， then the majority samples were undersampled based on the sample distribution weight in every iteration. And the weak classifier was trained on the temporary training set combining the undersampled majority samples with all minority samples. Finally， the density peak information of samples was combined with the categorical distribution of samples to define the different misclassification costs for all samples， and the weights of samples with higher misclassification cost were increased by the cost adjustment function. Experimental results on 10 KEEL datasets indicate that， the number of datasets with the highest performance achieved by DPBCPUSBoost is more than that of the imbalanced data classification algorithms such as Adaptive Boosting （AdaBoost）， Cost-sensitive AdaBoost （AdaCost）， Random UnderSampling Boosting （RUSBoost） and UnderSampling and Cost-sensitive Boosting （USCBoost）， in terms of evaluation metrics such as Accuracy， F1-Score， Geometric Mean （G-mean） and Area Under Curve （AUC） of Receiver Operating Characteristic （ROC）. Experimental results verify that the definition of sample misclassification cost and sampling weight of the proposed DPBCPUSBoost is effective.

Table and Figures | Reference | Related Articles | Metrics